Statistical mechanics approach to a reinforcement learning model with memory 
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We introduce a two-player model of reinforcement learning with memory. Past actions of an 
iterated game are stored in a memory and used to determine player's next action. To examine 
the behaviour of the model some approximate methods are used and confronted against numerical 
simulations and exact master equation. When the length of memory of players increases to infinity 
the model undergoes an absorbing-state phase transition. Performance of examined strategies is 
checked in the prisoner' dilemma game. It turns out that it is advantageous to have a large memory 
in symmetric games, but it is better to have a short memory in asymmetric ones. 



I. INTRODUCTION 

Game theory plays an increasingly important role in 
many disciplines such as sociology, economy, computer 
sciences or even philosophy [l[. Providing a firm math- 
ematical basis, this theory stimulates development of 
quantitative methods to study general aspects of con- 
flicts, social dilemmas, or cooperation. At the simplest 
level such situations can be described in terms of a two- 
person game with two choices. In the celebrated example 
of such a game, the Prisoner's Dilemma, these choices 
are called cooperate (C) and defect (D). The single Nash 
equilibrium, where both players defect, is not Pareto op- 
timal and in the iterated version of this game players 
might have some incentives to cooperate. However, find- 
ing an efficient strategy even for such a simple game is 
highly nontrivial albeit exciting task, as evidenced by 
the popularity of Axelrod's tournaments [2J . These tour- 
naments had the unquestionable winner - the strategy 
tit-for-tat. Playing in a given round what an opponent 
played in the previous round, the strategy tit-for-tat is 
a surprising match of effectiveness as well as simplicity. 
Later on various strategies were examined: determinis- 
tic, stochastic, or evolving in a way that mimic biological 
evolution. It was also shown that some strategies perform 
better than the strategy tit-for-tat, as an example one can 
mention the strategy called win-stay loose-shift [3]. In an 
interesting class of some other strategies previous actions 
are stored in the memory and used to determine future 
actions. However, since the number of possible previ- 
ous actions increases exponentially fast with the length 
of memory and a strategy has to encode the response for 
each of such possibilities, the length of memory has to 
be very short [J|. Such a short memory cannot detect 
possible longer-term patterns or trends in the actions of 
the opponent. 

Actually, the problem of devising an efficient strategy 
that would use the past experience to choose or avoid 
some actions is of much wider applicability, and is known 
as reinforcement learning. Intensive research in this field 
resulted in a number of models |5j, but mathematical 
foundations and analytical insight into their behaviour 
seems to be less developed. Much of the theory of the 
reinforcement learning is based on the Markov Decision 



Processes where it is assumed that the player environ- 
ment is stationary [fj]. Extension of this essentially single- 
player problem to the case of two or more players is more 
difficult but some attempts have been already made [3]. 
Urn models [8j and various buyers-sellers models [9( were 
also examined in the context od reinforcement learning. 
In most of the reinforcement learning models [ljj, LUj 
past experience is memorized only as an accumulated 
payoff. Although this is an important ingredient, stor- 
ing the entire sequence of past actions can potentially be 
more useful in devising efficient strategies. To get a pre- 
liminary insight into such an approach, in the present 
paper we introduce a model of an iterated game be- 
tween two players. A player stores in its memory the 
past actions of an opponent and uses this information to 
determine probability of its next action. We formulate 
approximate methods to describe the behaviour of our 
model and confront them against numerical simulations 
and exact master equation. Let us notice that numerical 
simulations are the main and often the only tool in the 
study of reinforcement learning models. The possibility 
to use analytical and sometimes even exact approaches 
such as those used in the present paper seems to be a rare 
exception. Our calculations show that when the length 
of memory increases to infinity, a transition between dif- 
ferent regimes of our model takes place, that is analogous 
to an absorbing-state phase transition [12j | . Similar phase 
transitions might exist in spatially extended, multi-agent 
systems [l3| , however in the introduced two-player model 
this transition has a much different nature, namely it 
takes place only in the space of memory configurations. 



II. A REINFORCEMENT LEARNING MODEL 
WITH MEMORY 

In our model we consider a pair of players playing 
repeatedly a game like e.g., the prisoner's dilemma. A 
player i (i = 1,2) is equipped with a memory of length 
U, where it sequentially stores the last U decisions made 
by its opponent. For simplicity let us consider a game 
with two decisions that we denote as C and D. An exam- 
ple that illustrates a memory change in a single round of 
a game is shown in Fig. [T] (we will mostly examine the 
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A. Mean-value approximation 




FIG. 1: Memory change during a single round of a game with 
two players with memories of length I = 5. The first player 
shifts all memory cells to the right (removing the rightmost 
element) and puts the last decision (D) of the second player 
at the left end. Analogous change takes place in the memory 
of the second player 



symmetrical case where l\ = h, and the index i denoting 
the player will be thus dropped). 

A player uses the information in its memory to evaluate 
the opponent's behaviour and to calculate probabilities 
of making its own decisions. Having in mind a possible 
application to the prisoner's dilemma we make the ea- 
gerness to cooperate of a player to be dependent on the 
frequency of cooperation of its opponent. More specifi- 
cally, we assume that the probability p t for a player to 
play C at the time t is given by 
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where rit is the number of C's in player's memory at time 
t while a > 0, b > are some additional parameters. In 
principle a can take any value such that < a < 1 but 
numerical calculations presented below were made only 
for a — 1 that left us with only two control parame- 
ters, namely b and I, that determine the behaviour of the 
model. For a = 1 the model has an interesting absorbing 
state: provided that both players have rit = they both 
have pt = and thus they will be forever trapped in this 
(noncooperative) state. As we will see, this feature in the 
limit I — > oo leads to a kind of phase transition (already 
in the case of two players). 

The content of the memory in principle might pro- 
vide much more valuable information on the opponent 
behaviour than Eq. (fT]) which is only one of the sim- 
plest possibilities. As we already mentioned, our choice 
of the cooperation probability (fT]) was motivated by the 
Prisoner's Dilemma but of course for other games differ- 
ent expressions might be more suitable. Moreover, more 
sophisticated expressions, for example based on some 
trends in the distribution of C's, might lead to more ef- 
ficient strategies but such a possibility is not explored in 
the present paper. 

Let us also notice, that in our approach the memory 
of a player stores the sequence of past actions of length I 
(and that information is used to calculate the probability 
of cooperation). We do not store the response to each 
possible past sequence of actions (as e.g., in J4|) and that 
is why memory requirements in our model increase only 
linearly with I and not exponentially. 



Despite a simple formulation the analysis of the model 
is not entirely straightforward. This is mainly because 
the probability pt is actually a random variable that 
depends on the dynamically determined content of a 
player's memory. However, some simple arguments can 
be used to determine the evolution of pt at least for large 
I. Indeed, in such a case one might expect that fluctu- 
ations of n t /l are negligible and it might be replaced in 
Eq. (fT]) with its mean value. Since at time t the coeffi- 
cient rit of player (1) equals to the number of C's made 
by its opponent (2) during I previous steps we obtain the 
following expression for its mean value 
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where the upper indices denote the players. Under such 
an assumption we obtain that the evolution of probabil- 
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ities p\ is given by the following equations 
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In Eq. (J3|) we assume that both players are characterized 
by the same values of b and I, but generalization to the 
case where these parameters are different is straightfor- 
ward. To iterate Eq. © we have to specify 21 initial 
values. For the symmetric choice 
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we obtain symmetric solutions (i.e., with Eq IJ3J being 
satisfied for any t). In such a case the upper indices in 
Eq. (J5J) can be dropped. 

For large I the mean-value approximation |3]) is quite 
accurate. Indeed, numerical calculations show that al- 
ready for I — 40 this approximation is in very good agree- 
ment with Monte Carlo simulations (Fig. [2]). However, 
for smaller I a clear discrepancy can be seen. 

Provided that in the limit t — > oo the system reaches 
a steady state [pt — p) , in the symmetric case we obtain 



p = 1 — exp(— bp). 



(5) 



Elementary analysis show that for b < 1 the only solution 
of ([5]) is p = and for b > 1 there is also an additional 
positive solution. Such a behaviour typically describes 
a phase transition at the mean-field level, but further 
discussion of this point will be presented at the end of 
this section. 



B. Independent-decisions approximation 

As we already mentioned, the mean-value approxima- 
tion ([3]) neglects fluctuations of rit around its mean value. 
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FIG. 2: The cooperation probability p as a function of time 
t. The dashed lines correspond to the mean-value approx- 
imation ((3} while the continuous line shows the solution of 
independent-decisions approximations if?}. Simulation data 
(□) are averages over 10 4 independent runs. For I — 24 sim- 
ulations and independent-decisions approximation (J7J are in 
a very good agreement while mean-value approximation (J3j> 
slightly differs. For I = 40 calculations using (J7J are not 
feasible but for such a large I a satisfactory description is ob- 
tained using the mean- value approximation ((3j . Calculations 
for 1 = 6 shows that independent-decisions approximation 
deviates from simulations. Results of approx. <(3j are not pre- 
sented but in this case they differ even more from simulation 
data. The decrease of p as seen in the simulation data is due 
to the the small probability of entering an absorbing state (no 
cooperation). On the other hand, approximations ((3} as well 
as (J7J predict that for t — > oo the probability p tends to a 
positive value. For I = 24 and 40 as initial conditions we took 
(symmetric case) pt = 0.7, t = 1,2, ... ,1 and for / = 6 we 
used pt — 0.5. Initial conditions in Monte Carlo simulations 
corresponded to these values. 



form for arbitrary I, Eq. ((6]) can be written as 
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where the summation in Eq. |(7J) is over all 2 l config- 
urations (sequences) {E k } where E k — C or D and 
k = l,...,l. Moreover, n({E k }) equals the number of 
C's in a given sequence and 
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For 1 = 2, Eq. ([7]) can be written as 



(8) 



(1,2) 
Pt 



(2,1) (2,1) 

■■Pt-iPt-2'ra 



pf-fd 



Pt-2 Fl + 



-(l-p&Wn 



+(i-pf-i 1) )(i 



(2,1) 
Pt-2 



Iro, 



(9) 



where r k = 1 — exp (— bk/2). 

The number of terms in the sum of Eq. ([7]) increases 
exponentially with I, but numerically one can handle cal- 
culations up to I = 24 ~ 28. Solution of Eq. ([Jj is in much 
better agreement with simulations than the mean-value 
approximation®. For example for I = 24 and b = 2 
it essentially overlaps with simulations, while © clearly 
differs (Fig©. 

Despite an excellent agreement seen in this case, the 
scheme (UJ is not exact. As we already mentioned, this 
is because the product form of the probability p con f is 
based on the assumption that decisions at time t — l,t — 
2, ... ,t — I are independent, while in fact they are not. 
For smaller values of / the (increasing in time) difference 
with simulation data might be quite large (Figj2|). 



In this subsection we try to take them into account. Let 
us notice that a player with memory length I can be in 
one of the 2 l configurations (conf). Provided that we 
can calculate probability p con f of being in such a config- 
uration (at time t), we can write 
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where n(conf) is the number of C's in a given configu- 
ration conf and the summation is over all 2 l configura- 
tions; indices of players are temporarily omitted. But for 
a given configuration we know its sequence of C's and D's 
and thus its history. For example, if at time t a memory 
of a player (with 1 = 3) contains CDD it means that at 
time t — 1 its opponent played C and at time t — 2 and 
t — 3 played D (we use the convention that most recent el- 
ements are on the left side) . Assuming that such actions 
are independent, in the above example the probability 
of the occurrence of this sequence might be written as 
p t (l - Pt-i)(l —pt-i)- Writing p con f in such a product 



C. Master equation 

In this subsection we present the exact master equation 
of this system. This equation directly follows from the 
stochastic rules of the model and describes the evolution 
of probabilities of the system being in a given state. Let 
us notice that a state of the system is given by specifying 
the memory content of both agents. In the following we 
present the explicit form of this equation only in the case 
1 = 2, but an extension to larger I is straightforward but 
tedious. We denote the occupation probability of being 
at time t in the state where the first player has in its 
memory the values E, F and the second one has G and 
H as p t ' . Assuming that the parameters b and I 
are the same for both players and that symmetric initial 
conditions are used 
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enables us to reduce the number of equations from 16 to 
10. The resulting equations preserve the symmetry (JTT 
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FIG. 3: The cooperation probability as a function of time 
t for two players with 1 = 2. Exact master equation solu- 
tion (|11[) - (|12[ ) (solid line) is in perfect agreement with simu- 
lations (□) and deviates from the independent-decisions ap- 
proximation ((91 (dotted line). 



for any t and are the same for each of the players. The 
master equation of our model for t = 1, 2, . . . takes the 
following form 
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Iterating Eq. (fTTj) one can calculate all occupation prob- 
abilities p t ' . The result can be used to obtain the 
probability of cooperating at time t 
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FIG. 4: The steady-state cooperation probability p as a func- 
tion of b. The independent-decision approximation ([7} for 
increasing i converges to the mean-value approximation ([5]) 
that in the limit I = co presumably becomes exact. In the 
asymmetric case the cooperation probability of each player is 
different. The first player (pl-1) has the memory length h = 1 
and the second player (pl-2) has 1% = 10 3 or 3 • 10 3 . 



For b — 2 and 4 the numerical results are presented in 
Fig. [3l One can see that they are in perfect agreement 
with simulations. Let us notice that for 6 = 2 after 
a small initial increase, the cooperation probability pt 
decreases in time. This is an expected feature and is 
caused by the existence of the absorbing state DD,DD. 
Of course, the equations (jTTJ) reflect this fact: the proba- 
bility Pt-i enters only the last equation, namely that 
describing the evolution of p t ' (in other words, none 
of the states can be reached from this state) . Although 
on a larger time scale pt would decrease also for b — 4, on 
the examined time scale it seems to saturate at a positive 
value. Solutions (i.e., p t ) obtained from the independent- 
decisions approximation as well as mean-value approx- 
imation saturates at some positive values in the limit 
t — > oo and thus approximately correspond to such quasi- 
stationary states. 

The (quasi-)stationary behaviour of the model is pre- 
sented in Fig. |4j Provided that b is large enough the 
players remain in the cooperative phase; otherwise they 
enter the absorbing (noncooperative) state. However, for 
finite memory length I the cooperative state is only a 
transient state, and after a sufficiently large time an ab- 
sorbing state will be reached. Thus, strictly speaking, a 
phase transition between cooperative and noncooperative 
regimes takes place only in the limit I — ► oo. In this limit 
the mean-value approximation §5§ correctly describes the 
behaviour of the model. Simulations agree with (|5|), but 
to obtain good agreement for b close to the transition 
point value b = 1, the length of memory I should be 
large. 

We have also examined the nonsymmetric (with re- 
spect to the memory length) case, where the first player 



\< player-2 
player-1 \. 


C 


D 


c 


(3,3) 


(0,5) 


D 


(5,0) 


(1,1) 



FIG. 5: The payoff matrix of the prisoner's dilemma game 
used in the calculations presented in Figs. [6][7] The first and 
the second number of a pair in a given cell denotes payoff of 
the first and second player, respectively. 



has the memory of finite length li and the length of the 
memory of the second player li diverges. Simulations for 
l\ = 1 and I2 = 10 3 and 3 TO 3 show that in this case there 
is also a phase transition (Fig. |4|) but at a larger value 
of b than in the symmetric case (apparently, fluctuations 
due to the short memory of the first player ease the ap- 
proach of an absorbing state). Results for larger values 
of h (not presented) show that this transition approaches 
the phase transition in the symmetric case. 

The phase transition that is shown in Figgis an exam- 
ple of an absorbing-state phase transition with coopera- 
tive and noncooperative phases corresponding to active 
and absorbing phases, respectively [12J . Such transitions 
appear also for some models of Prisoner's Dilemma (or 
other games) in spatially extended systems [13j , i.e., the 
phase transition appears in the limit when the number 
of players increases to infinity. In the present model the 
nature of this transition is much different: the number of 
players remains finite (and equal to two) but the length 
of memory diverges. 



III. PRISONER'S DILEMMA 

In this section we examine our players in an explicit ex- 
ample of the Prisoner's Dilemma with the typically used 
payoff matrix that is shown in Fig. [5J Results of the cal- 
culations of the time dependence of the average payoff 
are presented in Figs.[6][7l Simulations in the symmetric 
case (Fig. |5|) show that the larger the memory length I, 
the larger the payoff. In the asymmetric case (Fig . [7J 
the shorter-memory player for large t has larger payoff, 
but initially it might have the smaller payoff than the 
longer-memory player. In simulations shown in Figs. [Hr 
[7J the memory length was rather short and the model 
relatively quickly enters the absorbing (noncooperative) 
state. That is why the average payoff converges asymp- 
totically to unity. Although this is not shown, such a 
behaviour was seen also in the asymmetric case, but on 
a larger time scale than that presented in Fig. [Jj 

Using solely the results shown in Figs.[B][7Jit is difficult 
to predict what are the parameters (/, b) of the best (i.e., 
accumulating the largest payoff) player. This is because 
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FIG. 6: The time evolution of the average payoff in the sym- 
metric case (1 1 — h — I) for b — 1.5 and several values of I. 
Results are averages over 10 independent runs. As an initial 
state each player at each cell of its memories has C or D with 
probabilities 0.3 and 0.7, respectively. 
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FIG. 7: The time evolution of the average payoff in the asym- 
metric case l\ — 5 and h = 10. Results are averages over 
10 5 independent runs. As an initial state each player at each 
cell of its memories has C or D with probabilities 0.3 and 0.7, 
respectively. 



the performance of a given player depends on the pa- 
rameters of the opponent, number of rounds or even the 
initial content of the memory. And already the length of 
memory alone results in conflicting properties: it pays off 
to have a large memory in symmetric games (Fig.[B]), but 
it is better to have a short memory in asymmetric ones 
(Fig. [7J. It would be thus interesting to perform Axel- 
rod's type tournament that would make the evolutionary 
selection of the winner, where the accumulated payoff 
of each player would determine its fitness. Particularly 
interesting might be to examine a spatially extended ver- 
sion of such a tournament, where opponents of a given 
player would be only its neighbouring sites. In such a 



tournament one can check for example whether spatial 
effects modify the nature (i.e., universality class) of the 
absorbing-state phase transition. And of course, it would 
be interesting to check whether in such an ensemble of 
players the strategy tit-for-tat, that in our model is ob- 
tained for / = 1 and b — » oo, will be again invincible. 

As a further extension one can consider playing multi- 
decision games. In such a case an additional group struc- 
ture might appear and examination of the nature of co- 
operation becomes much more subtle (14| . 



IV. CONCLUSIONS 

In the present paper we have introduced a reinforce- 
ment learning model with memory and have analysed it 
using approximate methods, numerical simulations and 
exact master equation. In the limit when the length of 
memory becomes infinite the model has an absorbing- 
state phase transition. The objective of the paper was to 
develop general approaches (such as approximate descrip- 
tions or master-equation analysis) to study such models, 
and that is why rather a simple and motivated by the 
Prisoner Dilemma form JT]) of the cooperation probabil- 
ity, was used. In some particular games more compli- 
cated functions might prove more efficient. One can also 
consider storing in player's memory some additional in- 
formation concerning, e.g., players own moves. Perhaps 
analytical approaches, that we used in some simple exam- 



ples, can be adapted to such more complicated problems 
as well. 

We also suggested that it would be desirable to perform 
Axelrod's type tournament for players with memory (as 
in our work), but in addition equipped with some evo- 
lutionary abilities [151 ]. Such a tournament would allow 
us to examine the coexistence of learning and evolution 
that is an interesting subject on its own. Better learn- 
ing abilities might influence the survival and thus direct 
the evolution via the so-called Baldwin effect [16J ■ Some 
connections between learning and evolution were already 
examined also in the game-theory setup [13, LL8| • For the 
present model a detailed insight at least into learning 
processes is available and coupling them with evolution- 
ary processes might lead to some interesting results in 
this field. 

Finally, let us notice that decision making based on 
the content of memory seems to be connected with the 
psychophysical relation between response and stimulus. 
Early attempts to express such a relation in mathemat- 
ical terms lead to the so-called Weber-Fechner law 
Despite some works that reproduce this type of law 
further research, perhaps using models similar to those 
described in the present paper, would be desirable. 
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